Abstract

This study investigates the distribution of Dormibacterota, a phylum of bacteria, across NEON sites. It also investigates the taxonomic breakdown at a specific site, Guanica State Forest. Dormibacterota are currently poorly understood due to their low abundance compared to other bacterial phyla. This project displays taxonomic breakdowns, using R as a data analysis tool, that provides insight into the abundance of Dormibacterota across different sites and the diversity within the tropical forest ecosystem that is Guanica State Forest. The analysis displayed variations in Dormibacterota composition among different sites, suggesting that this phylum prefers certain environmental conditions. Within Guanica State Forest, distinct taxonomic profiles were observed with relatively low diversity at that site alone. Lastly, this project provides valuable insights into the ecological roles of Dormibacterota in various ecosystems and to the colonization that is possible in tropical forest ecosystems.

Motivating Reasons

The motivating factors for this experiment was to analyze the diversity of taxonomic groups at Guanica State Forest. Knowing the diversity of taxonomic groups can allow researchers to understand what species can grow in certain areas around the world depending on the geography of the area. Additionally, analyzing the sites in which Dormibacterota was found allows researchers to understand where this certain type of species prefers to live in or thrives in. Learning data analysis, using outside data, provides a great skill to early researchers that will only make better scientists in the long run. Data analysis also allows researchers to recognize patterns which can hint at relationships between environments and organisms living there. Doing research like this, also provides researchers with a way of knowing where species rich environments are so that they can be protected and used for further research. ## Introduction Guánica State Forest is a subtropical dry forest in southwest Puerto Rico. It is the best preserved dry forest in the Caribbean. It has a warm climate with two rainy/ hurricane seasons. It is home to over 700 species of plants that are divided into three groups: deciduous forest, semi-evergreen forest, and scrub forest. Its most famous plant is a guaiac wood tree that could be as old as 1,000 years. This site is home to multiple different ecosystems including beaches, coral reefs, salt flats, mangrove forests, and limestone caverns (Sotomayor-Mena and Rios-Velazquez (2020)). Half of Puerto Rico’s birds occur in the Guánica State Forest and it is one of the few habitats where the Cook’s pallid anole (lizard species) can be found. This forest has both marine and terrestrial wildlife, including coral reefs, birds, grasshoppers, ants, etc. Dormibacterota is uncultured bacteria that is normally found in cold deserts and are a phylum of oligotrophic bacteria that live under the soil. They are known for their survival mechanisms that allow them to survive under starvation conditions. They are thought to be aerobic heterotrophs and based on genome analysis, they have been found to synthesize, store, and break down glycogen (Montgomery et al. (2021)). This phylum of bacteria is not very well researched since they are most commonly found in extremely cold environments. There is a lot of ongoing research that is looking into Dormibacterota phylogenetic relationships and their contribution to the environment in which they live.

Methods

Data Acquisition and Preparation: Data Collection: Taxonomic data can be obtained from various sources such as biodiversity databases, field surveys, or existing literature. I obtained my data from the National Ecological Observatory Network (NEON). Data Cleaning and Formatting: Clean the NEON data to remove any inconsistencies, missing values, or errors. Ensure that the data is formatted correctly for analysis in R Studio by making the data sets a workable size and only containing the columns that I wanted to analyze. Data Exploration and Visualization Exploratory Data Analysis: Explored the taxonomic data to understand its structure, distribution, and characteristics. I used histograms, bar graphs, and box plots to visualize these features. Software and Packages R Studio: Performed all analyses using R. Pushed projects to GitHub for storage and collaboration. R Packages: Utilize various R packages for taxonomic analysis such as tidyverse, ggtree, and data.table.

Results

Dormibacterota Across All NEON Sites

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ggtree)
## ggtree v3.10.1 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
## 
## Attaching package: 'ggtree'
## 
## The following object is masked from 'package:tidyr':
## 
##     expand
library(TDbook) 
library(ggimage)
library(rphylopic)
## You are using rphylopic v.1.4.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
## 
## Attaching package: 'rphylopic'
## 
## The following object is masked from 'package:ggimage':
## 
##     geom_phylopic
library(treeio)
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
library(tidytree)
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 
## Guangchuang Yu.  Data Integration, Manipulation and Visualization of
## Phylogenetic Trees (1st edition). Chapman and Hall/CRC. 2022,
## doi:10.1201/9781003279242
## 
## Attaching package: 'tidytree'
## 
## The following object is masked from 'package:treeio':
## 
##     getNodeNum
## 
## The following object is masked from 'package:stats':
## 
##     filter
library(ape)
## 
## Attaching package: 'ape'
## 
## The following objects are masked from 'package:tidytree':
## 
##     drop.tip, keep.tip
## 
## The following object is masked from 'package:treeio':
## 
##     drop.tip
## 
## The following object is masked from 'package:ggtree':
## 
##     rotate
## 
## The following object is masked from 'package:dplyr':
## 
##     where
library(TreeTools)
## 
## Attaching package: 'TreeTools'
## 
## The following object is masked from 'package:tidytree':
## 
##     MRCA
## 
## The following object is masked from 'package:treeio':
## 
##     MRCA
## 
## The following object is masked from 'package:ggtree':
## 
##     MRCA
library(phytools)
## Loading required package: maps
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
## 
## 
## Attaching package: 'phytools'
## 
## The following object is masked from 'package:TreeTools':
## 
##     as.multiPhylo
## 
## The following object is masked from 'package:treeio':
## 
##     read.newick
library(ggnewscale)
library(ggtreeExtra)
## ggtreeExtra v1.12.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
library(ggstar)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON.csv")
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NEON_MAGs)
## # A tibble: 6 × 19
##   `Bin ID`      `Genome Name`        `IMG Genome ID` `Bin Quality` `Bin Lineage`
##   <chr>         <chr>                          <dbl> <chr>         <chr>        
## 1 3300060643_14 Terrestrial soil mi…      3300060643 MQ            <NA>         
## 2 3300060643_16 Terrestrial soil mi…      3300060643 MQ            Bacteria     
## 3 3300060643_18 Terrestrial soil mi…      3300060643 MQ            Bacteria; Ac…
## 4 3300060643_2  Terrestrial soil mi…      3300060643 MQ            Bacteria; Ac…
## 5 3300060643_28 Terrestrial soil mi…      3300060643 MQ            Bacteria; Ps…
## 6 3300060643_35 Terrestrial soil mi…      3300060643 MQ            Bacteria; Ac…
## # ℹ 14 more variables: `GTDB-Tk Taxonomy Lineage` <chr>, `Bin Methods` <chr>,
## #   `Created By` <chr>, `Date Added` <date>, `Bin Completeness` <dbl>,
## #   `Bin Contamination` <dbl>, `Total Number of Bases` <dbl>, `5s rRNA` <dbl>,
## #   `16s rRNA` <dbl>, `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>,
## #   `Scaffold Count` <dbl>, `GOLD Study ID` <chr>
str(NEON_MAGs)
## spc_tbl_ [1,754 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Bin ID                  : chr [1:1754] "3300060643_14" "3300060643_16" "3300060643_18" "3300060643_2" ...
##  $ Genome Name             : chr [1:1754] "Terrestrial soil microbial communities from National Grasslands LBJ, Texas, USA - CLBJ_001-M-20210506-comp-1" "Terrestrial soil microbial communities from National Grasslands LBJ, Texas, USA - CLBJ_001-M-20210506-comp-1" "Terrestrial soil microbial communities from National Grasslands LBJ, Texas, USA - CLBJ_001-M-20210506-comp-1" "Terrestrial soil microbial communities from National Grasslands LBJ, Texas, USA - CLBJ_001-M-20210506-comp-1" ...
##  $ IMG Genome ID           : num [1:1754] 3.3e+09 3.3e+09 3.3e+09 3.3e+09 3.3e+09 ...
##  $ Bin Quality             : chr [1:1754] "MQ" "MQ" "MQ" "MQ" ...
##  $ Bin Lineage             : chr [1:1754] NA "Bacteria" "Bacteria; Actinomycetota; Actinomycetes" "Bacteria; Actinomycetota; Actinomycetes" ...
##  $ GTDB-Tk Taxonomy Lineage: chr [1:1754] "Bacteria; Acidobacteriota; Blastocatellia; Pyrinomonadales; Pyrinomonadaceae; PSRF01" "Bacteria; Acidobacteriota; Vicinamibacteria; Vicinamibacterales; UBA2999; Gp6-AA45" "Bacteria; Actinobacteriota; Actinomycetia; Streptosporangiales; Streptosporangiaceae; Chersky-822" "Bacteria; Actinobacteriota; Actinomycetia; Mycobacteriales; Jatrophihabitantaceae; JAFAWL01" ...
##  $ Bin Methods             : chr [1:1754] "MetaBAT v2:2.15, CheckM v1.2.1, GTDB-tk v2.1.1, GTDB database release R207_v2" "MetaBAT v2:2.15, CheckM v1.2.1, GTDB-tk v2.1.1, GTDB database release R207_v2" "MetaBAT v2:2.15, CheckM v1.2.1, GTDB-tk v2.1.1, GTDB database release R207_v2" "MetaBAT v2:2.15, CheckM v1.2.1, GTDB-tk v2.1.1, GTDB database release R207_v2" ...
##  $ Created By              : chr [1:1754] "IMG_PIPELINE" "IMG_PIPELINE" "IMG_PIPELINE" "IMG_PIPELINE" ...
##  $ Date Added              : Date[1:1754], format: "2023-04-06" "2023-04-06" ...
##  $ Bin Completeness        : num [1:1754] 96.2 77.5 77.2 58.4 68.7 ...
##  $ Bin Contamination       : num [1:1754] 2.56 5.3 1.99 3.74 4.67 0 2.97 3.16 1.71 5.17 ...
##  $ Total Number of Bases   : num [1:1754] 6247032 5394623 4389455 3228217 3245901 ...
##  $ 5s rRNA                 : num [1:1754] 0 0 0 0 0 1 3 0 1 0 ...
##  $ 16s rRNA                : num [1:1754] 1 0 0 0 0 0 1 1 0 0 ...
##  $ 23s rRNA                : num [1:1754] 0 0 0 0 0 1 1 0 1 0 ...
##  $ tRNA Genes              : num [1:1754] 54 32 35 29 12 26 24 37 47 34 ...
##  $ Gene Count              : num [1:1754] 5373 5406 4705 3762 3446 ...
##  $ Scaffold Count          : num [1:1754] 39 878 607 592 474 386 270 547 10 186 ...
##  $ GOLD Study ID           : chr [1:1754] "Gs0161344" "Gs0161344" "Gs0161344" "Gs0161344" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Bin ID` = col_character(),
##   ..   `Genome Name` = col_character(),
##   ..   `IMG Genome ID` = col_double(),
##   ..   `Bin Quality` = col_character(),
##   ..   `Bin Lineage` = col_character(),
##   ..   `GTDB-Tk Taxonomy Lineage` = col_character(),
##   ..   `Bin Methods` = col_character(),
##   ..   `Created By` = col_character(),
##   ..   `Date Added` = col_date(format = ""),
##   ..   `Bin Completeness` = col_double(),
##   ..   `Bin Contamination` = col_double(),
##   ..   `Total Number of Bases` = col_double(),
##   ..   `5s rRNA` = col_double(),
##   ..   `16s rRNA` = col_double(),
##   ..   `23s rRNA` = col_double(),
##   ..   `tRNA Genes` = col_double(),
##   ..   `Gene Count` = col_double(),
##   ..   `Scaffold Count` = col_double(),
##   ..   `GOLD Study ID` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
NEON_MAGs_Ind <- NEON_MAGs %>% 
  filter(`Genome Name` != "NEON combined assembly") 

NEON_MAGs_Ind_tax <- NEON_MAGs_Ind %>% 
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE)
## Warning: Expected 6 pieces. Additional pieces discarded in 21 rows [12, 32, 66, 79, 80,
## 88, 96, 102, 104, 240, 334, 386, 657, 790, 846, 931, 943, 983, 1041, 1095,
## ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 282 rows [6, 7, 42, 49,
## 50, 55, 60, 83, 85, 97, 100, 105, 107, 113, 114, 116, 119, 125, 129, 130, ...].

All Phyla Counts

kable(
  NEON_MAGs_Ind_tax %>% 
    count(Phylum, sort = TRUE)
)
Phylum n
Actinobacteriota 418
Proteobacteria 248
Acidobacteriota 181
Verrucomicrobiota 57
NA 38
Chloroflexota 35
Myxococcota 29
Bacteroidota 22
Gemmatimonadota 16
Methylomirabilota 16
Planctomycetota 16
Dormibacterota 11
Eremiobacterota 11
Desulfobacterota_B 9
Desulfobacterota 5
Patescibacteria 5
Tectomicrobia 3
Cyanobacteria 2
Myxococcota_A 2
Armatimonadota 1
Chlamydiota 1
Eisenbacteria 1
Firmicutes 1
Krumholzibacteriota 1
Nitrospirota 1
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON.csv") %>% 
  # remove columns that are not needed for data analysis
  select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>% 
  # create a new column with the Assembly Type
  mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
                            TRUE ~ "Individual")) %>% 
  mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>% 
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "S-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 6 pieces. Additional pieces discarded in 29 rows [12, 32, 66, 79, 80,
## 88, 96, 102, 104, 240, 334, 386, 657, 790, 846, 931, 943, 983, 1041, 1095,
## ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 429 rows [6, 7, 42, 49,
## 50, 55, 60, 83, 85, 97, 100, 105, 107, 113, 114, 116, 119, 125, 129, 130, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_MAGs_bact_ind <- NEON_MAGs %>%
  filter(Domain == "Bacteria") %>%
  filter(`Assembly Type` == "Individual")

Phyla with Dormibacterota Filtered

kable(
  NEON_MAGs_Ind_tax %>% 
    count(Phylum,sort('Dormibacterota')))
Phylum sort(“Dormibacterota”) n
Acidobacteriota Dormibacterota 181
Actinobacteriota Dormibacterota 418
Armatimonadota Dormibacterota 1
Bacteroidota Dormibacterota 22
Chlamydiota Dormibacterota 1
Chloroflexota Dormibacterota 35
Cyanobacteria Dormibacterota 2
Desulfobacterota Dormibacterota 5
Desulfobacterota_B Dormibacterota 9
Dormibacterota Dormibacterota 11
Eisenbacteria Dormibacterota 1
Eremiobacterota Dormibacterota 11
Firmicutes Dormibacterota 1
Gemmatimonadota Dormibacterota 16
Krumholzibacteriota Dormibacterota 1
Methylomirabilota Dormibacterota 16
Myxococcota Dormibacterota 29
Myxococcota_A Dormibacterota 2
Nitrospirota Dormibacterota 1
Patescibacteria Dormibacterota 5
Planctomycetota Dormibacterota 16
Proteobacteria Dormibacterota 248
Tectomicrobia Dormibacterota 3
Verrucomicrobiota Dormibacterota 57
NA Dormibacterota 38
NEON_MAGs_bact_ind %>%
  ggplot(aes(x = Phylum)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Phyla Counts Across All Sites")

NEON_MAGs_bact_ind %>%
  ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Site)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Phyla Counts Labeled by Site")

NEON_MAGs_bact_ind %>% 
ggplot(aes(x = Phylum)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
  coord_flip() +
  facet_wrap(vars(Site), scales = "free", ncol = 2) +
  labs(title = "Phyla Counts Separated Out by Site")

NEON_MAGs_bact_ind %>%   
ggplot(aes(x = fct_infreq(Phylum), y = `Total Number of Bases`)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1)) +
  labs(title = "Total Number of Nucleotide Bases for each Major Phylum")

NEON_MAGs_bact_ind %>%
  ggplot(aes(x = Subplot, color = `Site ID`, fill = `Site ID`)) +
  geom_bar() +
  coord_flip() +
  labs (title = "Subplot Count Colored by Site ID")

NEON_MAGs_bact_ind %>%
  ggplot(aes(x = Site, fill = Phylum)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Phyla Counts at Various Sites, Colored by Phylum")

NEON_MAGs_bact_ind %>%
  ggplot(aes(x = `Total Number of Bases`, y = `Gene Count`, color = Phylum)) +
  geom_point() +
  coord_flip() +
  labs(title = "Gene Count vs Total Number of Bases At All Sites, Colored by Phylum")

NEON_MAGs_GSF <- NEON_MAGs %>%
  filter(str_detect(`Site`, "Guanica State Forest and Biosphere Reserve, Puerto Rico"))
NEON_MAGs_D <- NEON_MAGs %>%
  filter(str_detect(`GTDB-Tk Taxonomy Lineage`, "Dormibacterota"))
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>% 
  rename(`Genome Name` = `Genome Name / Sample Name`) %>% 
  filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>% 
  filter(str_detect(`Genome Name`, 'WREF plot', negate = T))
## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>% 
  # remove -COMP from genomicsSampleID
  mutate_at("genomicsSampleID", str_replace, "-COMP", "") 
## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr   (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl  (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date  (1): collectionDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_FULL <- NEON_MAGs %>% 
  left_join(NEON_metagenomes, by = c("Sample Name")) %>%
  left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID"))
NEON_FULL_D <- NEON_FULL %>%
  filter(str_detect(`Phylum`,"Dormibacterota" ))
NEON_FULL_D %>%   
ggplot(aes(x = `Site.x`, y = `soilInWaterpH`)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle=50, vjust=1, hjust=1)) +
  labs(title = "Soil Water pH Across Sites, Specific to Dormibacterota")
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

NEON_FULL_D %>%
  ggplot(aes(x = `Bin Contamination`)) +
  geom_bar() +
  labs(title = "Dormibacterota Bin Contamination Counts")

ggplot(data = NEON_FULL_D, aes(x = `Ecosystem Subtype`, y = `soilTemp`)) +
  geom_point(aes(color = Order)) +
  labs(title = "Ecosystem Subtype vs Temperture Colored by Order")
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>% 
  left_join(NEON_metagenomes, by = "Sample Name") %>% 
  left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>% 
  rename("label" = "Bin ID")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")
tree_bac_preorder <- Preorder(tree_bac)
tree_Dormibacterota <- Subtree(tree_bac_preorder, 1767)
ggtree(tree_Dormibacterota)  %<+%
  NEON_MAGs_metagenomes_chemistry + 
  geom_tiplab(size=2, hjust=-.1) +
  xlim(0,20) +
  geom_point(mapping=aes(color=`Ecosystem Subtype`)) +
  labs(title = "Dormibacterota Ecosystem Subtype Displayed Using Phylogenetic Tree")

tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")
node_vector_bac = c(tree_bac$tip.label,tree_bac$node.label)
grep("Dormibacterota", node_vector_bac, value = TRUE)
## [1] "'1.0:p__Dormibacterota; c__Dormibacteria'"
match(grep("Dormibacterota", node_vector_bac, value = TRUE), node_vector_bac)
## [1] 1767
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>% 
  left_join(NEON_metagenomes, by = "Sample Name") %>% 
  left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>% 
  rename("label" = "Bin ID")
tree_bac_preorder <- Preorder(tree_bac)
tree_Dormibacterota <- Subtree(tree_bac_preorder, 1767)

NEON_MAGs_Dormibacterota <- NEON_MAGs_metagenomes_chemistry %>% 
  filter(Phylum == "Dormibacterota") 
ggtree(tree_bac, layout="circular", branch.length="none") +
  
    geom_hilight(node=1767, fill="steelblue", alpha=.6) +
    geom_cladelab(node=1767, label="Dormibacterota", align=TRUE, offset = 0, textcolor='steelblue', barcolor='steelblue') +

    geom_hilight(node=1789, fill="darkgreen", alpha=.6) +
    geom_cladelab(node=1789, label="Actinomycetota", align=TRUE, vjust=-0.4, offset = 0, textcolor='darkgreen', barcolor='darkgreen') +
  
      geom_hilight(node=2673, fill="darkorange", alpha=.6) +
    geom_cladelab(node=2673, label="Acidobacteriota", align=TRUE, hjust=1.1, offset = 0, textcolor='darkorange', barcolor='darkorange') +
  labs(title = "Circular Phylogenetic Tree Showing Dormibacterota in Relation to Actinomycetota and Acidobacteriota")

NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>% 
  rename("AssemblyType" = "Assembly Type") %>% 
  rename("BinCompleteness" = "Bin Completeness") %>% 
  rename("BinContamination" = "Bin Contamination") %>% 
  rename("TotalNumberofBases" = "Total Number of Bases") %>% 
  rename("EcosystemSubtype" = "Ecosystem Subtype")

ggtree(tree_Dormibacterota)  %<+%
  NEON_MAGs_metagenomes_chemistry + 
  geom_tippoint(aes(colour=`Ecosystem Subtype`)) + 

# For unknown reasons the following does not like blank spaces in the names
  geom_facet(panel = "Bin Completeness", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_point, 
      mapping=aes(x = BinCompleteness)) +
  geom_facet(panel = "Bin Contamination", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_col, 
                aes(x = BinContamination), orientation = 'y', width = .6) +
  theme_tree2(legend.position=c(.1, .7)) +
  labs(title = "Phylogenetic Tree Displaying Ecosystem Subtypes, Bin Completeness Counts, and Bin Contamination Counts")

ggtree(tree_Dormibacterota, layout="circular")  %<+%
  NEON_MAGs_metagenomes_chemistry + 
  geom_point2(mapping=aes(color=`Ecosystem Subtype`, size=`Total Number of Bases`)) +
  labs(title = "Circular Dormibacterota Phylogenetic Tree Displaying Total Number of Bases and Ecosystem Subtype")
## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point_g_gtree()`).

NEON_MAGs_Dormibacterota %>%
  ggplot(aes(x=`Ecosystem Subtype`))+ 
  geom_bar()+
  coord_flip() +
  labs(title = "Ecosystem Subtypes where Dormibacterota are Found")

Bacterial Genomes at Guanica State Forest

kable(
  NEON_metagenomes_GUAN <- NEON_metagenomes %>%
    select(c(`Sample Name`, `Site ID`, `Ecosystem Subtype`))
)
Sample Name Site ID Ecosystem Subtype
CLBJ_006-M-20210506 CLBJ Grasslands
CLBJ_002-M-20210506 CLBJ Grasslands
WOOD_004-M-20210714 WOOD Wetlands
TOOL_002-O-20210804 TOOL Tundra
WREF_004-M-20210622 WREF Temperate forest
TEAK_004-O-20210726 TEAK Temperate forest
HEAL_048-M-20210622 HEAL Boreal forest/Taiga
KONZ_043-M-20210721 KONZ Grasslands
YELL_048-M-20210707 YELL Temperate forest
TOOL_041-O-20210803 TOOL Tundra
TOOL_003-O-20210805 TOOL Tundra
GUAN_048-M-20210920 GUAN Tropical forest
WREF_004-O-20210622 WREF Temperate forest
SRER_043-M-20210809 SRER Desert
SRER_006-M-20210809 SRER Desert
NIWO_004-M-20210726 NIWO Temperate forest
TEAK_004-M-20210726 TEAK Temperate forest
NIWO_002-M-20210728 NIWO Temperate forest
CLBJ_040-M-20210503 CLBJ Grasslands
ONAQ_004-M-20210525 ONAQ Shrubland
SRER_004-M-20210809 SRER Desert
WREF_073-M-20210623 WREF Temperate forest
YELL_005-M-20210708 YELL Temperate forest
WREF_001-O-20210621 WREF Temperate forest
WOOD_005-M-20210708 WOOD Wetlands
GUAN_042-M-20210920 GUAN Tropical forest
ONAQ_010-M-20210526 ONAQ Shrubland
TOOL_005-O-20210806 TOOL Tundra
TOOL_042-O-20210803 TOOL Tundra
TOOL_006-O-20210804 TOOL Tundra
HEAL_048-O-20210622 HEAL Boreal forest/Taiga
YELL_002-M-20210706 YELL Temperate forest
TEAK_025-M-20210726 TEAK Temperate forest
KONZ_024-M-20210719 KONZ Grasslands
WOOD_043-M-20210712 WOOD Wetlands
CLBJ_032-M-20210504 CLBJ Grasslands
YELL_012-O-20210708 YELL Temperate forest
BONA_009-O-20210707 BONA Boreal forest/Taiga
TEAK_043-M-20210719 TEAK Temperate forest
SRER_053-M-20210810 SRER Desert
CLBJ_001-M-20210506 CLBJ Grasslands
ONAQ_008-M-20210524 ONAQ Shrubland
GUAN_006-M-20210922 GUAN Tropical forest
CLBJ_033-M-20210505 CLBJ Grasslands
TEAK_002-O-20210720 TEAK Temperate forest
BONA_004-O-20210707 BONA Boreal forest/Taiga
WOOD_042-M-20210712 WOOD Wetlands
YELL_016-M-20210708 YELL Temperate forest
KONZ_045-M-20210721 KONZ Grasslands
ONAQ_002-M-20210524 ONAQ Shrubland
GUAN_003-M-20210922 GUAN Tropical forest
SRER_047-M-20210809 SRER Desert
NA NA Shrubland
TEAK_005-M-20210728 TEAK Temperate forest
TEAK_005-O-20210728 TEAK Temperate forest
NIWO_001-O-20210728 NIWO Temperate forest
CLBJ_038-M-20210504 CLBJ Grasslands
GUAN_004-M-20210922 GUAN Tropical forest
ONAQ_005-M-20210527 ONAQ Shrubland
YELL_009-M-20210706 YELL Temperate forest
NIWO_004-O-20210726 NIWO Temperate forest
WREF_073-O-20210623 WREF Temperate forest
ONAQ_003-M-20210527 ONAQ Shrubland
WREF_003-O-20210622 WREF Temperate forest
TOOL_044-O-20210803 TOOL Tundra
BONA_001-O-20210708 BONA Boreal forest/Taiga
YELL_003-M-20210708 YELL Temperate forest
CLBJ_003-M-20210506 CLBJ Grasslands
WOOD_001-M-20210714 WOOD Wetlands
WOOD_002-M-20210708 WOOD Wetlands
TEAK_003-M-20210726 TEAK Temperate forest
SRER_005-M-20210810 SRER Desert
SRER_052-M-20210810 SRER Desert
KONZ_046-M-20210720 KONZ Grasslands
NIWO_003-M-20210727 NIWO Temperate forest
YELL_046-M-20210705 YELL Temperate forest
BONA_006-O-20210707 BONA Boreal forest/Taiga
GUAN_043-M-20210921 GUAN Tropical forest
WOOD_024-O-20210714 WOOD Wetlands
NIWO_005-M-20210726 NIWO Temperate forest
TOOL_004-O-20210805 TOOL Tundra
WREF_003-M-20210622 WREF Temperate forest
TOOL_043-O-20210803 TOOL Tundra
WOOD_024-M-20210714 WOOD Wetlands
YELL_051-M-20210705 YELL Temperate forest
KONZ_042-M-20210720 KONZ Grasslands
GUAN_007-M-20210922 GUAN Tropical forest
WOOD_003-M-20210708 WOOD Wetlands
ggplot(data = NEON_metagenomes_GUAN, aes(x = `Site ID`, y = `Ecosystem Subtype`)) +
  geom_point() +
  labs(title = "Ecosystem Subtype at each Site ID, Guanica State Forest = GUAN")

ggplot(NEON_MAGs_GSF)+geom_bar(mapping=aes(y=`GTDB-Tk Taxonomy Lineage`))+
  labs(title = "Count of each Taxonomy Lineage at Guanica State Forest")

NEON_MAGs_GSF %>%
  ggplot(aes(x=`Bin Lineage`))+ 
  geom_bar()+
  coord_flip() +
  labs(title = "Bin Lineage Counts at Guanica State Forest")

kable(
  NEON_MAGs %>%
    filter(str_detect(`Site`, "Guanica State Forest and Biosphere Reserve, Puerto Rico"))
)
Bin ID Site Sample Name Site ID Subplot Layer Date IMG Genome ID Bin Quality Bin Lineage GTDB-Tk Taxonomy Lineage Domain Phylum Class Order Family Genus Bin Completeness Bin Contamination Total Number of Bases 5s rRNA 16s rRNA 23s rRNA tRNA Genes Gene Count Scaffold Count Assembly Type
3300060854_24 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_007-M-20210922 GUAN 007 M 20210922 3300060854 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae NA NA NA NA NA NA NA 56.15 0.97 1533081 1 0 0 25 1741 146 Individual
3300060854_34 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_007-M-20210922 GUAN 007 M 20210922 3300060854 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae NA NA NA NA NA NA NA 53.25 1.46 1450069 0 0 0 30 1647 219 Individual
3300060854_37 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_007-M-20210922 GUAN 007 M 20210922 3300060854 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae; Nitrososphaera; Candidatus Nitrososphaera evergladensis NA NA NA NA NA NA NA 53.24 0.00 789887 0 1 2 13 1037 51 Individual
3300060854_44 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_007-M-20210922 GUAN 007 M 20210922 3300060854 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; Thermoleophilia; Gaiellales; Gaiellaceae; JACDAN01 Bacteria Actinobacteriota Thermoleophilia Gaiellales Gaiellaceae JACDAN01 56.82 9.48 2200962 0 0 0 35 2683 396 Individual
3300060854_5 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_007-M-20210922 GUAN 007 M 20210922 3300060854 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Acidimicrobiia; IMCC26256; PALSA-555 Bacteria Actinobacteriota Acidimicrobiia IMCC26256 PALSA-555 NA 52.76 4.27 1397348 0 0 0 21 1718 270 Individual
3300060854_6 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_007-M-20210922 GUAN 007 M 20210922 3300060854 MQ Bacteria; Actinomycetota; Thermoleophilia Bacteria; Actinobacteriota; Thermoleophilia; Solirubrobacterales; 70-9; VGBV01 Bacteria Actinobacteriota Thermoleophilia Solirubrobacterales 70-9 VGBV01 71.49 8.33 2400015 0 1 0 35 2858 362 Individual
3300060887_16 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Bacteria; Pseudomonadota; Alphaproteobacteria; Hyphomicrobiales Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobiales; Hyphomicrobiaceae; AWTP1-13 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Hyphomicrobiaceae AWTP1-13 54.50 1.88 3392335 0 0 0 18 3738 578 Individual
3300060887_21 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Acidimicrobiia; IMCC26256; PALSA-555 Bacteria Actinobacteriota Acidimicrobiia IMCC26256 PALSA-555 NA 67.66 4.16 2450677 0 0 0 39 2930 396 Individual
3300060887_26 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Bacteria; Actinomycetota; Thermoleophilia Bacteria; Actinobacteriota; Thermoleophilia; Solirubrobacterales; 70-9 Bacteria Actinobacteriota Thermoleophilia Solirubrobacterales 70-9 NA 79.35 9.48 2379854 1 0 1 31 2692 259 Individual
3300060887_27 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Bacteria Bacteria; Desulfobacterota_B; Binatia; UBA9968; UBA9968; DP-20 Bacteria Desulfobacterota_B Binatia UBA9968 UBA9968 DP-20 65.01 0.65 3097026 0 0 0 18 3210 404 Individual
3300060887_32 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae NA NA NA NA NA NA NA 63.85 0.00 1469352 1 0 1 17 1670 242 Individual
3300060887_39 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Archaea NA NA NA NA NA NA NA 89.81 2.43 4009591 1 0 0 41 4918 399 Individual
3300060887_40 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Bacteria Bacteria; Acidobacteriota; Blastocatellia; RBC074; RBC074; JADJLO01 Bacteria Acidobacteriota Blastocatellia RBC074 RBC074 JADJLO01 59.60 7.86 2998531 0 0 0 22 2763 312 Individual
3300060887_5 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_042-M-20210920 GUAN 042 M 20210920 3300060887 MQ Bacteria; Actinomycetota; Thermoleophilia Bacteria; Actinobacteriota; Thermoleophilia; Solirubrobacterales; Thermoleophilaceae; JACVRW01 Bacteria Actinobacteriota Thermoleophilia Solirubrobacterales Thermoleophilaceae JACVRW01 57.81 0.00 2174027 0 0 0 13 2413 343 Individual
3300060888_13 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_003-M-20210922 GUAN 003 M 20210922 3300060888 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae; Nitrososphaera; Candidatus Nitrososphaera evergladensis NA NA NA NA NA NA NA 80.10 2.91 1834459 1 1 1 37 2365 31 Individual
3300060888_15 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_003-M-20210922 GUAN 003 M 20210922 3300060888 MQ Bacteria Bacteria; Chloroflexota; Limnocylindria; QHBO01; QHBO01 Bacteria Chloroflexota Limnocylindria QHBO01 QHBO01 NA 56.14 4.59 1799820 0 0 0 31 2108 328 Individual
3300060888_26 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_003-M-20210922 GUAN 003 M 20210922 3300060888 MQ Bacteria; Actinomycetota; Actinomycetes; Mycobacteriales; Mycobacteriaceae Bacteria; Actinobacteriota; Actinomycetia; Mycobacteriales; Mycobacteriaceae; Mycobacterium Bacteria Actinobacteriota Actinomycetia Mycobacteriales Mycobacteriaceae Mycobacterium 62.27 2.61 4614098 0 0 0 43 5287 731 Individual
3300060898_12 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_048-M-20210920 GUAN 048 M 20210920 3300060898 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738; HRBIN12; AC-51 Bacteria Actinobacteriota UBA4738 UBA4738 HRBIN12 AC-51 72.79 9.83 1809081 0 0 0 31 2143 336 Individual
3300060898_28 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_048-M-20210920 GUAN 048 M 20210920 3300060898 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Acidimicrobiia; Acidimicrobiales; JACDCH01; ZC4RG19 Bacteria Actinobacteriota Acidimicrobiia Acidimicrobiales JACDCH01 ZC4RG19 50.16 0.00 2356576 0 1 0 12 2576 488 Individual
3300060898_45 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_048-M-20210920 GUAN 048 M 20210920 3300060898 MQ Bacteria; Pseudomonadota; Betaproteobacteria Bacteria; Proteobacteria; Gammaproteobacteria; Burkholderiales; SG8-39; SCGC-AG-212-J23 Bacteria Proteobacteria Gammaproteobacteria Burkholderiales SG8-39 SCGC-AG-212-J23 80.90 3.35 3608050 1 1 0 31 4193 373 Individual
3300060898_54 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_048-M-20210920 GUAN 048 M 20210920 3300060898 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738; HRBIN12; DSRY01 Bacteria Actinobacteriota UBA4738 UBA4738 HRBIN12 DSRY01 58.62 8.62 2193175 0 1 0 34 2537 386 Individual
3300060898_8 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_048-M-20210920 GUAN 048 M 20210920 3300060898 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738; HRBIN12; DSRY01 Bacteria Actinobacteriota UBA4738 UBA4738 HRBIN12 DSRY01 52.35 6.90 1241702 1 0 1 21 1441 212 Individual
3300060898_9 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_048-M-20210920 GUAN 048 M 20210920 3300060898 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Actinomycetia; Propionibacteriales; Nocardioidaceae Bacteria Actinobacteriota Actinomycetia Propionibacteriales Nocardioidaceae NA 91.29 2.23 4726679 0 1 0 62 5124 430 Individual
3300060914_13 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; Thermoleophilia; Gaiellales; Gaiellaceae; JACDAN01 Bacteria Actinobacteriota Thermoleophilia Gaiellales Gaiellaceae JACDAN01 53.03 0.00 1546924 1 0 1 25 1811 206 Individual
3300060914_14 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae NA NA NA NA NA NA NA 51.63 0.00 906705 1 0 0 16 1054 153 Individual
3300060914_17 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 HQ NA Bacteria; Acidobacteriota; Blastocatellia; Pyrinomonadales; Pyrinomonadaceae; JACMLC01 Bacteria Acidobacteriota Blastocatellia Pyrinomonadales Pyrinomonadaceae JACMLC01 91.45 4.32 4759799 1 1 1 43 4080 47 Individual
3300060914_23 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Actinomycetia; Jiangellales; Jiangellaceae Bacteria Actinobacteriota Actinomycetia Jiangellales Jiangellaceae NA 75.84 2.94 3336565 1 1 1 43 3769 468 Individual
3300060914_25 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Actinomycetes; Propionibacteriales Bacteria; Actinobacteriota; Actinomycetia; Propionibacteriales; Propionibacteriaceae Bacteria Actinobacteriota Actinomycetia Propionibacteriales Propionibacteriaceae NA 57.90 0.00 2403924 0 0 0 22 2610 296 Individual
3300060914_26 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter Bacteria; Actinobacteriota; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; SCSIO-52909 Bacteria Actinobacteriota Rubrobacteria Rubrobacterales Rubrobacteraceae SCSIO-52909 83.19 1.32 2465732 0 0 0 30 2772 268 Individual
3300060914_28 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738 Bacteria Actinobacteriota UBA4738 UBA4738 NA NA 58.12 1.71 1438439 1 2 1 31 1654 149 Individual
3300060914_30 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Archaea NA NA NA NA NA NA NA 96.12 1.94 3938282 1 0 0 43 4828 375 Individual
3300060914_32 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; CADDZG01; WHSQ01; WHSV01 Bacteria Actinobacteriota UBA4738 CADDZG01 WHSQ01 WHSV01 73.70 2.56 2469498 0 0 0 30 2747 288 Individual
3300060914_35 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria Bacteria; Acidobacteriota; Blastocatellia; RBC074; RBC074 Bacteria Acidobacteriota Blastocatellia RBC074 RBC074 NA 63.74 5.90 5269156 0 0 0 24 4815 533 Individual
3300060914_39 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Actinomycetia Bacteria Actinobacteriota Actinomycetia NA NA NA 94.79 2.99 5097796 0 0 0 102 5141 413 Individual
3300060914_41 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria Bacteria; Desulfobacterota_B; Binatia; UBA9968; UBA9968; DP-1 Bacteria Desulfobacterota_B Binatia UBA9968 UBA9968 DP-1 50.54 2.52 2583074 0 0 0 13 2824 279 Individual
3300060914_44 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria Bacteria; Chloroflexota; UBA6077; UBA6077; CF-72 Bacteria Chloroflexota UBA6077 UBA6077 CF-72 NA 72.44 4.17 6734988 1 0 1 46 6737 745 Individual
3300060914_46 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter Bacteria; Actinobacteriota; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; SCSIO-52909 Bacteria Actinobacteriota Rubrobacteria Rubrobacterales Rubrobacteraceae SCSIO-52909 58.04 0.00 2062795 1 0 0 23 2335 236 Individual
3300060914_49 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria Bacteria; Acidobacteriota; Vicinamibacteria; Vicinamibacterales; UBA2999 Bacteria Acidobacteriota Vicinamibacteria Vicinamibacterales UBA2999 NA 66.24 5.13 3149726 0 0 0 15 3046 511 Individual
3300060914_5 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria Bacteria; Acidobacteriota; Vicinamibacteria; Vicinamibacterales; 2-12-FULL-66-21 Bacteria Acidobacteriota Vicinamibacteria Vicinamibacterales 2-12-FULL-66-21 NA 51.90 0.00 1772468 0 0 0 8 1858 350 Individual
3300060914_50 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Acidimicrobiia; IMCC26256; PALSA-555 Bacteria Actinobacteriota Acidimicrobiia IMCC26256 PALSA-555 NA 61.87 2.59 1607679 0 0 0 24 1916 257 Individual
3300060914_52 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae; Nitrososphaera; Candidatus Nitrososphaera evergladensis NA NA NA NA NA NA NA 86.89 2.43 1541281 1 1 1 34 2026 57 Individual
3300060914_53 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae; Nitrososphaera; Candidatus Nitrososphaera evergladensis NA NA NA NA NA NA NA 66.83 2.91 1374179 1 1 1 39 1607 94 Individual
3300060914_55 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 MQ Bacteria; Actinomycetota; Actinomycetes; Propionibacteriales Bacteria; Actinobacteriota; Actinomycetia; Propionibacteriales; Nocardioidaceae Bacteria Actinobacteriota Actinomycetia Propionibacteriales Nocardioidaceae NA 89.84 2.68 3777568 1 2 1 25 4076 429 Individual
3300060914_9 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_043-M-20210921 GUAN 043 M 20210921 3300060914 HQ Bacteria; Actinomycetota; Thermoleophilia; Solirubrobacterales Bacteria; Actinobacteriota; Thermoleophilia; Solirubrobacterales; 70-9; VAYN01 Bacteria Actinobacteriota Thermoleophilia Solirubrobacterales 70-9 VAYN01 98.28 0.86 2383318 1 1 1 52 2505 18 Individual
3300061642_12 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_006-M-20210922 GUAN 006 M 20210922 3300061642 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738; HRBIN12; DSRY01 Bacteria Actinobacteriota UBA4738 UBA4738 HRBIN12 DSRY01 67.22 2.14 1451150 0 0 0 20 1700 256 Individual
3300061642_24 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_006-M-20210922 GUAN 006 M 20210922 3300061642 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738 Bacteria Actinobacteriota UBA4738 UBA4738 NA NA 62.54 8.62 1931241 1 1 1 23 2194 272 Individual
3300061642_25 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_006-M-20210922 GUAN 006 M 20210922 3300061642 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Actinomycetia; Propionibacteriales; Nocardioidaceae Bacteria Actinobacteriota Actinomycetia Propionibacteriales Nocardioidaceae NA 84.71 6.65 3972987 1 1 2 41 4345 516 Individual
3300061642_28 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_006-M-20210922 GUAN 006 M 20210922 3300061642 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae NA NA NA NA NA NA NA 77.02 4.37 1744480 1 0 0 28 1955 221 Individual
3300061643_10 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_004-M-20210922 GUAN 004 M 20210922 3300061643 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738; HRBIN12; AC-51 Bacteria Actinobacteriota UBA4738 UBA4738 HRBIN12 AC-51 76.07 2.14 2076534 0 0 0 36 2355 274 Individual
3300061643_15 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_004-M-20210922 GUAN 004 M 20210922 3300061643 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Acidimicrobiia; IMCC26256; PALSA-555 Bacteria Actinobacteriota Acidimicrobiia IMCC26256 PALSA-555 NA 59.33 1.99 1787927 0 0 0 26 2107 288 Individual
3300061643_17 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_004-M-20210922 GUAN 004 M 20210922 3300061643 MQ Bacteria; Actinomycetota; Actinomycetes Bacteria; Actinobacteriota; Actinomycetia; Propionibacteriales; Nocardioidaceae Bacteria Actinobacteriota Actinomycetia Propionibacteriales Nocardioidaceae NA 58.44 6.65 2868668 1 0 1 24 3267 618 Individual
3300061643_26 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_004-M-20210922 GUAN 004 M 20210922 3300061643 MQ Bacteria; Actinomycetota Bacteria; Actinobacteriota; UBA4738; UBA4738 Bacteria Actinobacteriota UBA4738 UBA4738 NA NA 67.33 1.42 1697423 1 0 1 25 1948 248 Individual
3300061643_31 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_004-M-20210922 GUAN 004 M 20210922 3300061643 MQ Archaea; Nitrososphaerota; Nitrososphaeria; Nitrososphaerales; Nitrososphaeraceae; Nitrososphaera; Candidatus Nitrososphaera evergladensis NA NA NA NA NA NA NA 54.85 0.00 899286 1 1 1 19 1154 27 Individual
3300061643_33 Guanica State Forest and Biosphere Reserve, Puerto Rico GUAN_004-M-20210922 GUAN 004 M 20210922 3300061643 MQ Bacteria; Verrucomicrobiota Bacteria; Verrucomicrobiota; Verrucomicrobiae; Chthoniobacterales; UBA10450; Udaeobacter Bacteria Verrucomicrobiota Verrucomicrobiae Chthoniobacterales UBA10450 Udaeobacter 59.59 2.36 1609121 0 0 0 21 1734 146 Individual
NEON_MAGs_GSF %>%
  ggplot(aes(x=`Class`))+ 
  geom_bar()+
  coord_flip() +
  labs(title = "Class Counts at Guanica State Forest")

NEON_MAGs_GSF %>%
  ggplot(aes(x=`Order`))+ 
  geom_bar()+
  coord_flip() +
  labs(title = "Order Counts at Guanica State Forest")

NEON_MAGs_GSF %>%
  ggplot(aes(x=`Family`))+ 
  geom_bar()+
  coord_flip() +
  labs(title = "Family Counts at Guanica State Forest")

NEON_MAGs_GSF %>%
  ggplot(aes(x=`Genus`))+ 
  geom_bar()+
  coord_flip() +
  labs(title = "Genus Counts at Guanica State Forest")

NEON_MAGs_GSF %>%
  ggplot(aes(x=`Bin Completeness`, y = `Bin Contamination`))+ 
  geom_point() +
  labs(title = "Bin Completeness Values vs Bin Contamination Values at Guanica State Forest")

Discussion

As mentioned above, the focal site was Guanica State Forest and Biosphere Reserve and the focal phylum was Dormibacterota. The most abundant order at Guanica State Forest was UBA4738, the most abundant genus was DSRY01, the most abundant domain was Bacteria, and the most abundant family was HRBIN12. The top bin lineage count was Bacteria, Actinomycetota and the the top taxonomy lineage was Bacteria, Actinobacteria, Acidimicrobiia, IMCC26256, PALSA-555. The most abundant bacteria at this site was Actinobacteria. Guanica State Forest was characterized as a tropical forest for ecosystem subtype with other subtypes being wetlands, tundra, temperate forest, shrubland, grasslands, desert, and boreal forest. The ecosystems where Dormibacterota were found included shrublands, grasslands, temperate forest, and boreal forest (in order from highest to lowest count). As can be seen in the circular phylogenetic tree, Dormibacterota is a very small phylum. This is further supported by the small number of bases in the data. It seems to prefer warmer temperatures based off of the soil temperature vs ecosystem subtype graph. Dormibacterota were the most abundant in Niwot Ridge, Colorado. With their counts in Yellostone and Denali National Park following closely behind. These results are significant because they provide insight into where Dormibacteriota prefer to live and what organisms are most abundant in Guanica State Forest, Puerto Rico. The research on Dormibacterota is very limited, from what I can find, so these results are good start into the evolving research. This data analysis had some limitations because R was difficult to use at times. The learning process may have altered some of the results since data may not be represented in the best way possible.

Conclusion

This research project looked at taxonomic groups at Guanica State Forest and at Dormibacterota, in particular, across multiple NEON sites. Interesting data was analyzed and presented but further research is definitely needed to understand the diversity of Guanica State Forest and the characteristics of Dormibacterota. I think that a deeper dive into Guanica State Forest would be beneficial to the field of genomics because it would allow researchers to better categorize the species that are there and add details to the broader scope research in this project. Going to the sites where Dormibacterota were found would allow researchers to understand why this phylum prefers to live in those locations and what they may be contributing to the environment there. Overall, this project provides a good summary of Dormibacterota across sites and of the taxnomic breakdown at Guanica State Forest but deeper analysis would provide a great deal of insight into all phyla and NEON sites, leading to a more comprehensive overview.

References

Montgomery, Kate, Timothy J. Williams, Merryn Brettle, Jonathan F. Berengut, Angelique E. Ray, Eden Zhang, Julian Zaugg, Philip Hugenholtz, and Belinda C. Ferrari. 2021. “Persistence and Resistance: Survival Mechanisms of Candidatus Dormibacterota from Nutrient-Poor Antarctic Soils.” Environmental Microbiology 23 (8): 4276–94. https://doi.org/10.1111/1462-2920.15610.
Sotomayor-Mena, Roberto G., and Carlos Rios-Velazquez. 2020. “Soil Microbiome Dataset from Guanica Dry Forest in Puerto Rico Generated by Shotgun Sequencing.” Data in Brief 28 (February): 104919. https://doi.org/10.1016/j.dib.2019.104919.